All files should be knit and compiled using R Markdown. Knit early and often! I do not recommend waiting until the end of the HW to knit.
All questions should be answered completely, and, wherever applicable, code should be included.
If you work with a partner or group, please write the names of your teammates.
Copying and pasting of code is a violation of the Skidmore honor code
Return to the Lahman package in R, and we’ll use the Batting data frame. Type ?Batting for specific insight into each variable. Primarily, it’s a table with 22 batting metrics. *For all questions, we’ll be using the Batting_1 data frame.
library(tidyverse)
library(Lahman)
Batting_1 <- Batting %>%
filter(yearID >= 2000) %>%
select (playerID, yearID, AB:SO) %>%
filter(AB >= 500)
Batting_1
Batting_1: that is, provide its dimensions, and what each row in the data set corresponds to.Batting_1 is a dataset that contains the hitting metrics for all batters after the year 1999 with at least 500 at bats for that season. Each row corresponds to one player in one season, and includes relevant batting metrics such as runs, hits, walks, and strikeouts. In this dataset there are 2003 entries, and each row has 13 corresponding variables.
Teams data set – as in our labs and prior homework – we often filtered by year. In the Batting data set, we are filtering by year and requiring an at-bat minimum. Why is this second step often required when working with players but not when working with teams?The second step is required on a player level because without filtering on a minimum number of at bats, this could result in plenty of extra rows that have the potential to skew the data. For example, in baseball minor league players are often brought up to play in the majors, and it is common that those players are not successful and they are sent back down to the minor leagues again. If this happens multiple times in a season for multiple teams, this could create a few hundred entries of players with very low and unhelpful values for data analysis.
Batting_1_numeric <- select(Batting_1, H, X2B, X3B, HR, RBI, SO)
correlation = cor(Batting_1_numeric)
library(corrplot)
round(correlation, 3)
## H X2B X3B HR RBI SO
## H 1.000 0.495 0.181 0.133 0.329 -0.260
## X2B 0.495 1.000 -0.087 0.234 0.405 -0.052
## X3B 0.181 -0.087 1.000 -0.289 -0.292 -0.063
## HR 0.133 0.234 -0.289 1.000 0.845 0.467
## RBI 0.329 0.405 -0.292 0.845 1.000 0.265
## SO -0.260 -0.052 -0.063 0.467 0.265 1.000
corrplot(correlation, method = "circle")
lm command. Finally, interpret the slope and intercept of this line.ggplot(data = Batting_1, aes(x = HR, y = RBI)) + geom_point() + geom_smooth(method = "lm")
fit_1 <- lm(RBI ~ HR, data = Batting_1)
summary(fit_1)
##
## Call:
## lm(formula = RBI ~ HR, data = Batting_1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -38.167 -8.541 -0.716 8.177 52.352
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.66305 0.60586 70.42 <2e-16 ***
## HR 1.81679 0.02565 70.82 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.51 on 2001 degrees of freedom
## Multiple R-squared: 0.7148, Adjusted R-squared: 0.7147
## F-statistic: 5015 on 1 and 2001 DF, p-value: < 2.2e-16
The slope states that for every home run a player hits, they are expected to have an increase in the number of runs batted in by about 1.82. This is more so a statement on the number of runners on base when a player does hit a home run, which states that often there are solo home runs, with exceptions of a runner, two runners, and sometimes three runners on base. The intercept states that if a player hits no home runs for a season, they are expected to end the year with around 42 total RBIs for the season.
With 47 home runs, the expected runs batted in for the season would be about 128.20. This leaves the resdiual for Pete Alonso to be -19.20, which means that he scored about 19 runs less than expected with the amount of home runs he hit for the seasons.
There could be mulitple reasons for why he has less RBIs than expected. One might be that maybe his team was not very successful, so they were seldom any baserunners when he hit his home runs. This would result in Alonso only hitting solo home runs, which could result in the slope being slightly overestimated in his case. Another explanation is that it is possible the Alonso was strictly a power hitter, and only hit home runs. This would mean that he was lesss efficient in converting runs when he hit anything other than home runs, which would suggest in this case that 40 RBIs aside from home runs would be a bit high.
annotate command to add in a label (Alonso’s name, or a symbol) with where Alonso lies. Read more about annotate here: https://ggplot2.tidyverse.org/reference/annotate.html. Among players hitting Alonso’s number of home runs, is his RBI total surprising?rbi_hr <- ggplot(data = Batting_1, aes(x = HR, y = RBI)) + geom_point() + geom_smooth(method = "lm")
rbi_hr + annotate("point", x = 47, y = 109, colour = "red")
When plotting Alonso’s results on the scatter plot, his totals are not too surprising. The number of RBIs is a bit low for the amount of home runs he hit, however his results are not an outlier, and they are not too far away for where the trendline is shown.
Batting_1 <- Batting_1 %>%
mutate(K_rate = SO/(AB + BB))
ggplot(data = Batting_1, aes(x = K_rate)) +
geom_density()
ggplot(data = Batting_1, aes(x = K_rate, colour = yearID, group = yearID)) +
geom_density()
K_rate?K_rate is the number of times a better strikes out over all of his appearances at the plate.
K_rate: e.g, what is its center, shape, and spreadLooking at the distribution for K_rate, it appears that the density curve has a center at about 0.15, and there appears to be a left skew. The spread seems to span from about 0.35 to 0.05, so there is a range of about 0.30.
K_rate has changed over the last two decades. Be precise. Have the center/shape/spread changed? If so, by how much?In a broad statement, it would appear that either pitchers are getting better, hitters are getting worse, or maybe both. One important thing to note is that the spread remains around 0.30 for much of the graph. However, as the years progress, it seems that the K_rate shape is moving away from a left skew, and appears to be moving towards a much more symmetrical curve. In addition, the center for graph shifted from about a 0.15 towards 0.20, which suggests that more batters are striking out more often then in years past.
In the above example, we looked at strikeout rate – that is, the percentage of time that a player strikes out.
https://blogs.fangraphs.com/basic-hitting-metric-correlation-1955-2012-2002-2012/.
What rate metrics in baseball are most repeatable? Which metrics are least repeatable?
Based on the tables that are shown in the charts in the article, from the years 1955-2012, it seems that the most repeatable hitting metrics are (SO)/(AB+SF), SO/PA, HR/(B-K+SF), while the least repeatable metrics are (BABIP), (3B+2B)/(H-HR), 2B/PA. From the years 2002-2012, it seems that the most repeatable hitting metrics are Contact%, SwStr%, and (SO)/(AB+SF), while the least repeatable metrics are (BABIP), 2B/PA, LD%
Let’s assess the repeatability of the metrics in Batting_2, shown below:
Batting_2 <- Batting_1 %>%
mutate(HR_rate = HR/(AB + BB),
BB_rate = BB/(AB + BB),
RBI_rate = RBI/(AB + BB))
Batting_2 <- Batting_2 %>%
arrange(playerID, yearID) %>%
group_by(playerID) %>%
mutate(HR_rate_next = lead(HR_rate),
K_rate_next = lead(K_rate),
BB_rate_next = lead(BB_rate),
RBI_rate_next = lead(RBI_rate)) %>%
ungroup() %>%
filter(!is.na(HR_rate_next))
Batting_2
Note: The code drops the last year of a players’ career – there is no future variable to look at.
BB_rate), HR rate, and RBI rate. That is, compare each metric in a players’ current year to the metric that he records in the following year. Which of these metrics is most repeatable? Which of these is least repeatable?ggplot(data = Batting_2, aes(x = HR_rate, y = HR_rate_next)) + geom_point() + geom_smooth(method = "lm")
ggplot(data = Batting_2, aes(x = K_rate, y = K_rate_next)) + geom_point() + geom_smooth(method = "lm")
ggplot(data = Batting_2, aes(x = BB_rate, y = BB_rate_next)) + geom_point() + geom_smooth(method = "lm")
ggplot(data = Batting_2, aes(x = RBI_rate, y = RBI_rate_next)) + geom_point() + geom_smooth(method = "lm")
Batting_2_numeric <- Batting_2 %>%
select(HR_rate_next, K_rate_next, BB_rate_next, RBI_rate_next, HR_rate, K_rate, BB_rate, RBI_rate)
cor_matrix <- cor(Batting_2_numeric, use="pairwise.complete.obs")
cor_matrix
## HR_rate_next K_rate_next BB_rate_next RBI_rate_next
## HR_rate_next 1.0000000 0.4379130 0.4394466 0.8318930
## K_rate_next 0.4379130 1.0000000 0.2640946 0.2393048
## BB_rate_next 0.4394466 0.2640946 1.0000000 0.3629798
## RBI_rate_next 0.8318930 0.2393048 0.3629798 1.0000000
## HR_rate 0.7290753 0.3921259 0.4534374 0.6525709
## K_rate 0.4262219 0.8550500 0.2672502 0.2592676
## BB_rate 0.3763968 0.2338264 0.7839615 0.3468802
## RBI_rate 0.6345370 0.2241243 0.3832183 0.6771140
## HR_rate K_rate BB_rate RBI_rate
## HR_rate_next 0.7290753 0.4262219 0.3763968 0.6345370
## K_rate_next 0.3921259 0.8550500 0.2338264 0.2241243
## BB_rate_next 0.4534374 0.2672502 0.7839615 0.3832183
## RBI_rate_next 0.6525709 0.2592676 0.3468802 0.6771140
## HR_rate 1.0000000 0.4185328 0.4454445 0.8367687
## K_rate 0.4185328 1.0000000 0.2300630 0.2270375
## BB_rate 0.4454445 0.2300630 1.0000000 0.3799765
## RBI_rate 0.8367687 0.2270375 0.3799765 1.0000000
From the data shown above, it seems like K_rate and BB_rate had the strongest association with their next years predicted value, while HR_rate and RBI_rate seemed to have a weaker association. This seems to show that K_rate and BB_rate are more repeatable than HR_rate and RBI_rate.
Batting_2 %>%
summarise(mae_k_rate = mean(abs(K_rate - K_rate_next)),
mse_k_rate = mean((K_rate - K_rate_next)^2))
Interpret the mae_k_rate above. How does this number relate to the scatter plot (using K_rate) in Question No. 10?
Mean absolute error measures the error found in the predicted value, so this would mean in this case that there is an average error of about 0.0217 found in the K_rates between the predicted and actual value. This value is relatively low, which would agree with our findings, as it seems that K_rate was the most repeatable out of the four variables that were reviewed.
HR_rate instead of K_rate.Batting_2 %>%
summarise(mae_k_rate = mean(abs(HR_rate - HR_rate_next)),
mse_k_rate = mean((HR_rate - HR_rate_next)^2))
ggplot(data = Batting_2, aes(x = yearID, y = HR_rate)) +
geom_line(aes(group = playerID), colour = "grey") +
geom_point(aes(group = playerID), colour = "grey") +
geom_smooth()
The code above is showing the differences in HR-rate across multiple seasons from individual players.
ggplot(data = Batting_2, aes(x = yearID, y = K_rate)) +
geom_line(aes(group = playerID), colour = "grey") +
geom_point(aes(group = playerID), colour = "grey") +
geom_smooth()
From the scatter plot shown above, it appears that on average, hitters seem to be striking out more often in recent year then they were in the early 2000s. With that said, from the scatter plot regarding home run rate, it seems although the trendline has some vertical movement, the average seems to settle at about the same spot between the early 2000s and recent years. However, one interesting note is that in years between 2013 and 2015, home run rate took its worst dip, where it seems that strikeout rate continued to rise slightly quicker than in the years prior.